Revisiting amino acid substitution matrices for identifying distantly related proteins
نویسندگان
چکیده
MOTIVATION Although many amino acid substitution matrices have been developed, it has not been well understood which is the best for similarity searches, especially for remote homology detection. Therefore, we collected information related to existing matrices, condensed it and derived a novel matrix that can detect more remote homology than ever. RESULTS Using principal component analysis with existing matrices and benchmarks, we developed a novel matrix, which we designate as MIQS. The detection performance of MIQS is validated and compared with that of existing general purpose matrices using SSEARCH with optimized gap penalties for each matrix. Results show that MIQS is able to detect more remote homology than the existing matrices on an independent dataset. In addition, the performance of our developed matrix was superior to that of CS-BLAST, which was a novel similarity search method with no amino acid matrix. We also evaluated the alignment quality of matrices and methods, which revealed that MIQS shows higher alignment sensitivity than that with the existing matrix series and CS-BLAST. Fundamentally, these results are expected to constitute good proof of the availability and/or importance of amino acid matrices in sequence analysis. Moreover, with our developed matrix, sophisticated similarity search methods such as sequence-profile and profile-profile comparison methods can be improved further. AVAILABILITY AND IMPLEMENTATION Newly developed matrices and datasets used for this study are available at http://csas.cbrc.jp/Ssearch/.
منابع مشابه
Inconsistent Distances in Substitution Matrices can be Avoided by Properly Handling Hydrophobic Residues
The adequacy of substitution matrices to model evolutionary relationships between amino acid sequences can be numerically evaluated by checking the mathematical property of triangle inequality for all triplets of residues. By converting substitution scores into distances, one can verify that a direct path between two amino acids is shorter than a path passing through a third amino acid in the a...
متن کاملPeriodic distributions of hydrophobic amino acids allows the definition of fundamental building blocks to align distantly related proteins.
Several studies on large and small families of proteins proved in a general manner that hydrophobic amino acids are globally conserved even if they are subjected to high rate substitution. Statistical analysis of amino acids evolution within blocks of hydrophobic amino acids detected in sequences suggests their usage as a basic structural pattern to align pairs of proteins of less than 25% sequ...
متن کاملStatistical potential-based amino acid similarity matrices for aligning distantly related protein sequences.
Aligning distantly related protein sequences is a long-standing problem in bioinformatics, and a key for successful protein structure prediction. Its importance is increasing recently in the context of structural genomics projects because more and more experimentally solved structures are available as templates for protein structure modeling. Toward this end, recent structure prediction methods...
متن کاملNucleotide substitution and recombination at orthologous loci in Staphylococcus aureus.
The pattern of nucleotide substitution was examined at 2,129 orthologous loci among five genomes of Staphylococcus aureus, which included two sister pairs of closely related genomes (MW2/MSSA476 and Mu50/N315) and the more distantly related MRSA252. A total of 108 loci were unusual in lacking any synonymous differences among the five genomes; most of these were short genes encoding proteins hig...
متن کاملSearching for frameshift evolutionary relationships between protein sequence families.
The protein sequence database was analyzed for evidence that some distinct sequence families might be distantly related in evolution by changes in frame of translation. Sequences were compared using special amino acid substitution matrices for the alternate frames of translation. The statistical significance of alignment scores were computed in the true database and shuffled versions of the dat...
متن کامل